I Made A Dataset So Dense It Broke My Hard Drive
I have a new dataset. It is called Dense-PRISM. It lives on Hugging Face. It is 164 GB. My hard drive cried when I uploaded it. My internet provider sent me a concerned email. I am proud.
Density is not about size. Density is about information per byte. Dense-PRISM has so much information per byte that bytes are now asking for raises.
The Numbers
Let us talk about scale. Because numbers are fun and also terrifying.
The math works like this. Four thousand ninety-six top tokens logged for every single generated token. Seven hundred ninety-nine prompts. Average response length times four thousand ninety-six times seven hundred ninety-nine equals total training signals.
Six hundred fifty-four million training signals. From seven hundred ninety-nine prompts. That is the power of density. That is the curse of density. My hard drive understands the curse personally.
What Is In The File
Each entry contains the standard conversation format. User asks. Assistant answers. Then comes the gold. For every token in the response, you get the top 4096 alternatives with their log probabilities.
That ellipsis represents four thousand ninety-three more tokens. Multiply that by every token in every response. You get Dense-PRISM. You get a file that makes file explorers hesitate.
Why 4096
Why not 50? Why not 100? Why not a reasonable number that does not break storage systems? Because 4096 is a power of two. Because it feels technical. Because I wanted to see what would happen.
Also, 4096 tokens covers a meaningful slice of the vocabulary. It shows the model not just the top choices but the entire neighborhood of possibilities. It teaches semantic distance through probability gradients.
A model trained on Dense-PRISM knows that "Why?" and "Hey! whats up?" live in different probability neighborhoods. It learns tone through math. It learns style through statistics.
The Free Part
Yes, it is free. MIT license. Download it. Fork it. Train your tiny models on it. Make something smarter than my tiny models. Please. My GPU needs the competition.
I could have put it behind a paywall. I could have made it exclusive. I did not. Open source is the point. Sharing is the point. Watching other people build cool things with my weird datasets is the point.
Storage Considerations
164 GB is large. It will take time to download. It will take space to store. It will take patience to parse. This is the cost of density.
I learned these tips the hard way. My RAM cried. My swap file screamed. My patience evaporated. You do not need to repeat my mistakes. Learn from my pain.
What This Teaches
Standard distillation teaches what to say. Dense-PRISM teaches how to choose what to say. The student model sees the probability landscape. It understands why certain tokens fit certain contexts. It learns the shape of appropriate responses.
A model trained on this knows that formal questions deserve formal answers. It knows that casual greetings invite casual responses. It knows the distance between tones. It learns through exposure to the full spectrum of possibility.
The Math Again
Let us return to the formula because it is beautiful in a terrifying way.
Each prompt becomes a universe of possibilities. Each token becomes a lesson in probability. Each logprob becomes a teacher. This is distillation at maximum density.
Who Should Use This
People training small models. People who want their models to understand nuance. People who have large hard drives. People who enjoy watching progress bars move very slowly.
If you are training a model under 1B parameters, Dense-PRISM can teach it to speak with more intention. If you are training a model under 100M parameters, it can teach it to choose words with more care. If you are training a model under 10M parameters, it might teach it to form coherent sentences. Progress is relative.
Final Thoughts
Dense-PRISM exists. It is 164 GB. It has 4096 top tokens per generated token. It has 799 prompts. It is free. It is dense. It is available now on Hugging Face.
Download it if you dare. Train on it if you can. Make something better than my confused tiny models. That is the goal. That is the dream. That is Dense-PRISM.